Using Tranformers Models to Score Splice-Junctions: Why, Which and How¶

Fine-Tuning DNAbert1, DNAbert2 and NT¶

Setting up the environment¶

conda create -n env_dnabert2 python=3.8 conda activate env_dnabert2 git clone https://github.com/Zhihan1996/DNABERT_2.git cd DNABERT_2/ python3 -m pip install -r requirements.txt python3 -m pip install scikit-learn pip uninstall triton

DGX Setup:¶

sudo docker pull nvcr.io/nvidia/pytorch:23.08-py3 sudo docker run --gpus all -it --rm -v local_dir:/raid/Fadel/SpliceUp/ nvcr.io/nvidia/pytorch:23.08-py3 test123! docker run --gpus all -it --rm -v /raid/Fadel/SpliceUp:/SpliceUp nvcr.io/nvidia/pytorch:23.08-py3 test123!

Fine-Tuning using GUE dataset¶

https://drive.google.com/file/d/1GRtbzTe3UXYF1oW27ASNhYX3SZ16D7N2/view

bash /data/SpliceUp/scripts/tune.sh /data/SpliceUp/datasets/GUE/splice/reconstructed/ /data/SpliceUp/tuned_models/ GUE 400

datasets = "/raid/Fadel/SpliceUp/datasets/RefSeq200/"

# Collecting all splice junctions even if they dont follow our stict rules, these junction will be 
# used to mask the genome with Ns in order to generat fake junction from the rest of the genome
# example for window=6
# Genome                |--------------------------- exon -----------------|              |---------- exon ----------|
# Genome after masking  |--------------------------------------------NNNNNNNNNNNN   NNNNNNNNNNNN---------------------| 

Generating RefSeq datasets¶

InĀ [Ā ]:
# Define file paths
path_fasta = "/data/references/GRCh38_RefSeq_p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz"
path_gtf = "/data/references/GRCh38_RefSeq_p14/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz"
output_path = "/data/DeepSAP/datasets/"
pickles_path = "/data/DeepSAP/pickles/"
dataset = "RefSeq"
max_n_junctions = 500000  # No capping

# Read FASTA sequences into genome_sequences_table
genome_sequences_table = {}  # Dictionary to store genome sequences

# Read GTF file and extract gene, transcript, and exon information
genes_table = {}  # Dictionary to store gene annotations
transcripts_table = {}  # Dictionary to store transcript annotations
exons_table = {}  # Dictionary to store exon annotations

for window in [90, 150, 200, 400]:
    
    # Collect strictly annotated splice junctions (for training)
    strict_annotated_junctions_table = {}  # Dictionary to store strict junctions
    
    # Collect all splice junctions (for masking genome)
    not_strict_annotated_junctions_table = {}  # Dictionary to store non-strict junctions
    
    # Define number of synthetic junctions to create
    negative_dataset_factor = 1.0
    n_made_up_junctions = negative_dataset_factor * len(strict_annotated_junctions_table.keys())
    
    # Generate synthetic junction sequences from non-strict junctions
    made_up_junctions_table = {}  # Dictionary to store made-up junctions
    
    # Write tuning dataset files (train, eval, test) to output directory
    path = output_path + dataset + str(window)
    
    # Write dataset for prediction (dev set) after fine-tuning
    path = output_path + dataset + str(window) + "_PREDICT"
    
    # Save strict junctions table as a pickle file
    path = pickles_path + "strict_annotated_junctions_table__" + dataset + "__" + str(window) + ".pkl"
    
    # Save non-strict junctions table as a pickle file
    path = pickles_path + "not_strict_annotated_junctions_table__" + dataset + "__" + str(window) + ".pkl"

Generating GENCODE datasets¶

InĀ [Ā ]:
# Define file paths
path_fasta = "/data/references/GRCh38_v44/GRCh38.primary_assembly.genome.fa.gz"
path_gtf = "/data/references/GRCh38_v44/gencode.v44.primary_assembly.annotation.gtf.gz"
output_path = "/data/DeepSAP/datasets/"
pickles_path = "/data/DeepSAP/pickles/"
dataset = "GENCODE"
max_n_junctions = 500000  # No capping

# Read FASTA sequences into genome_sequences_table
genome_sequences_table = {}  # Dictionary to store genome sequences

# Read GTF file and extract gene, transcript, and exon information
genes_table = {}  # Dictionary to store gene annotations
transcripts_table = {}  # Dictionary to store transcript annotations
exons_table = {}  # Dictionary to store exon annotations

for window in [90, 150, 200, 400]:
    
    # Collect strictly annotated splice junctions (for training)
    strict_annotated_junctions_table = {}  # Dictionary to store strict junctions
    
    # Collect all splice junctions (for masking genome)
    not_strict_annotated_junctions_table = {}  # Dictionary to store non-strict junctions
    
    # Define number of synthetic junctions to create
    negative_dataset_factor = 1.0
    n_made_up_junctions = (negative_dataset_factor + 0.5) * len(strict_annotated_junctions_table.keys())
    
    # Generate synthetic junction sequences from non-strict junctions
    made_up_junctions_table = {}  # Dictionary to store made-up junctions
    
    # Write tuning dataset files (train, eval, test) to output directory
    path = output_path + dataset + str(window)
    
    # Write dataset for prediction (dev set) after fine-tuning
    path = output_path + dataset + str(window) + "_PREDICT"
    
    # Save strict junctions table as a pickle file
    path = pickles_path + "strict_annotated_junctions_table__" + dataset + "__" + str(window) + ".pkl"
    
    # Save non-strict junctions table as a pickle file
    path = pickles_path + "not_strict_annotated_junctions_table__" + dataset + "__" + str(window) + ".pkl"

Generating MS150 dataset¶

Homo_sapiens (RefSeq)¶

https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz

Homo_sapiens (GENCODE)¶

https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.primary_assembly.annotation.gtf.gz

Danio rerio¶

https://ftp.ensembl.org/pub/release-110/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.toplevel.fa.gz https://ftp.ensembl.org/pub/release-110/gtf/danio_rerio/Danio_rerio.GRCz11.110.gtf.gz

Mus_musculus:¶

https://ftp.ensembl.org/pub/release-110/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.toplevel.fa.gz https://ftp.ensembl.org/pub/release-110/gtf/mus_musculus/Mus_musculus.GRCm39.110.gtf.gz

Drosophila_melanogaster¶

https://ftp.ensembl.org/pub/release-110/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz https://ftp.ensembl.org/pub/release-110/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.110.gtf.gz

Caenorhabditis_elegans¶

https://ftp.ensembl.org/pub/release-110/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz https://ftp.ensembl.org/pub/release-110/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.110.gtf.gz

Rattus norvegicus¶

https://ftp.ensembl.org/pub/release-110/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz https://ftp.ensembl.org/pub/release-110/gtf/rattus_norvegicus/Rattus_norvegicus.mRatBN7.2.110.gtf.gz

Plasmodium falciparum¶

http://ftp.ensemblgenomes.org/pub/protists/release-57/fasta/plasmodium_falciparum/dna/Plasmodium_falciparum.ASM276v2.dna.toplevel.fa.gz http://ftp.ensemblgenomes.org/pub/protists/release-57/gtf/plasmodium_falciparum/Plasmodium_falciparum.ASM276v2.57.gtf.gz

Plasmodium vivax¶

http://ftp.ensemblgenomes.org/pub/protists/release-57/fasta/plasmodium_vivax/dna/Plasmodium_vivax.ASM241v2.dna.toplevel.fa.gz http://ftp.ensemblgenomes.org/pub/protists/release-57/gtf/plasmodium_vivax/Plasmodium_vivax.ASM241v2.57.gtf.gz

Plasmodium berghei¶

http://ftp.ensemblgenomes.org/pub/protists/release-57/fasta/plasmodium_berghei/dna/Plasmodium_berghei.PBANKA01.dna.toplevel.fa.gz http://ftp.ensemblgenomes.org/pub/protists/release-57/gtf/plasmodium_berghei/Plasmodium_berghei.PBANKA01.57.gtf.gz

Plasmodium knowlesi¶

http://ftp.ensemblgenomes.org/pub/protists/release-57/fasta/plasmodium_knowlesi/dna/Plasmodium_knowlesi.ASM635v1.dna.toplevel.fa.gz http://ftp.ensemblgenomes.org/pub/protists/release-57/gtf/plasmodium_knowlesi/Plasmodium_knowlesi.ASM635v1.57.gtf.gz

InĀ [Ā ]:
# Define file paths
path_fasta = "/Users/zingo/Downloads/danio/Danio_rerio.GRCz11.dna.toplevel.fa.gz"
path_gtf = "/Users/zingo/Downloads/danio/Danio_rerio.GRCz11.110.gtf.gz"
output_path = "/Users/zingo/Downloads/datasets/"
pickles_path = "/Users/zingo/Downloads/pickles/"
dataset = "DANIO"
max_n_junctions = 500000  # No capping

# Read FASTA sequences into genome_sequences_table
genome_sequences_table = {}  # Dictionary to store genome sequences

# Read GTF file and extract gene, transcript, and exon information
genes_table = {}  # Dictionary to store gene annotations
transcripts_table = {}  # Dictionary to store transcript annotations
exons_table = {}  # Dictionary to store exon annotations

# Define window size
window = 150  

# Collect strictly annotated splice junctions (for training)
DANIO_strict_annotated_junctions_table = {}  # Dictionary to store strict junctions
InĀ [Ā ]:
# Define file paths
path_fasta = "/Users/zingo/Downloads/GRCm39/Mus_musculus.GRCm39.dna.toplevel.fa.gz"
path_gtf = "/Users/zingo/Downloads/GRCm39/Mus_musculus.GRCm39.110.gtf.gz"
output_path = "/Users/zingo/Downloads/datasets/"
pickles_path = "/Users/zingo/Downloads/pickles/"
dataset = "MUS"
max_n_junctions = 500000  # No capping

# Read FASTA sequences into genome_sequences_table
genome_sequences_table = {}  # Dictionary to store genome sequences

# Read GTF file and extract gene, transcript, and exon information
genes_table = {}  # Dictionary to store gene annotations
transcripts_table = {}  # Dictionary to store transcript annotations
exons_table = {}  # Dictionary to store exon annotations

# Define window size
window = 150  

# Collect strictly annotated splice junctions (for training)
MUS_strict_annotated_junctions_table = {}  # Dictionary to store strict junctions
InĀ [Ā ]:
# Define file paths
path_fasta = "/Users/zingo/Downloads/droso/Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz"
path_gtf = "/Users/zingo/Downloads/droso/Drosophila_melanogaster.BDGP6.46.110.gtf.gz"
output_path = "/Users/zingo/Downloads/datasets/"
pickles_path = "/Users/zingo/Downloads/pickles/"
dataset = "DROSO"
max_n_junctions = 500000  # No capping

# Read FASTA sequences into genome_sequences_table
genome_sequences_table = {}  # Dictionary to store genome sequences

# Read GTF file and extract gene, transcript, and exon information
genes_table = {}  # Dictionary to store gene annotations
transcripts_table = {}  # Dictionary to store transcript annotations
exons_table = {}  # Dictionary to store exon annotations

# Define window size
window = 150  

# Collect strictly annotated splice junctions (for training)
DROSO_strict_annotated_junctions_table = {}  # Dictionary to store strict junctions
InĀ [Ā ]:
# Define file paths
path_fasta = "/Users/zingo/Downloads/elegans/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz"
path_gtf = "/Users/zingo/Downloads/elegans/Caenorhabditis_elegans.WBcel235.110.gtf.gz"
output_path = "/Users/zingo/Downloads/datasets/"
pickles_path = "/Users/zingo/Downloads/pickles/"
dataset = "ELEGANS"
max_n_junctions = 500000  # No capping

# Read FASTA sequences into genome_sequences_table
genome_sequences_table = {}  # Dictionary to store genome sequences

# Read GTF file and extract gene, transcript, and exon information
genes_table = {}  # Dictionary to store gene annotations
transcripts_table = {}  # Dictionary to store transcript annotations
exons_table = {}  # Dictionary to store exon annotations

# Define window size
window = 150  

# Collect strictly annotated splice junctions (for training)
ELEGANS_strict_annotated_junctions_table = {}  # Dictionary to store strict junctions
InĀ [Ā ]:
# Define file paths
path_fasta = "/Users/zingo/Downloads/rattus/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz"
path_gtf = "/Users/zingo/Downloads/rattus/Rattus_norvegicus.mRatBN7.2.110.gtf.gz"
output_path = "/Users/zingo/Downloads/datasets/"
pickles_path = "/Users/zingo/Downloads/pickles/"
dataset = "RATTUS"
max_n_junctions = 500000  # No capping

# Read FASTA sequences into genome_sequences_table
genome_sequences_table = {}  # Dictionary to store genome sequences

# Read GTF file and extract gene, transcript, and exon information
genes_table = {}  # Dictionary to store gene annotations
transcripts_table = {}  # Dictionary to store transcript annotations
exons_table = {}  # Dictionary to store exon annotations

# Define window size
window = 150  

# Collect strictly annotated splice junctions (for training)
RATTUS_strict_annotated_junctions_table = {}  # Dictionary to store strict junctions
InĀ [Ā ]:
# Define file paths
path_fasta = "/Users/zingo/Downloads/GRCh38_v44/GRCh38.primary_assembly.genome.fa.gz"
path_gtf = "/Users/zingo/Downloads/GRCh38_v44/gencode.v44.primary_assembly.annotation.gtf.gz"
output_path = "/Users/zingo/Downloads/datasets/"
pickles_path = "/Users/zingo/Downloads/pickles/"
dataset = "GENCODE"
max_n_junctions = 500000  # No capping

# Read FASTA sequences into genome_sequences_table
genome_sequences_table = {}  # Dictionary to store genome sequences

# Read GTF file and extract gene, transcript, and exon information
genes_table = {}  # Dictionary to store gene annotations
transcripts_table = {}  # Dictionary to store transcript annotations
exons_table = {}  # Dictionary to store exon annotations

# Define window size
window = 150  

# Collect strictly annotated splice junctions (for training)
GENCODE_strict_annotated_junctions_table = {}  # Dictionary to store strict junctions
InĀ [Ā ]:
# List of datasets and their corresponding variable names
datasets = {
    "GENCODE": GENCODE_strict_annotated_junctions_table,
    "DANIO": DANIO_strict_annotated_junctions_table,
    "MUS": MUS_strict_annotated_junctions_table,
    "DROSO": DROSO_strict_annotated_junctions_table,
    "ELEGANS": ELEGANS_strict_annotated_junctions_table,
    "RATTUS": RATTUS_strict_annotated_junctions_table
}

# Save each dataset as a pickle file
for species, junctions_table in datasets.items():
    path = os.path.join(pickles_path, f"{species}_strict_annotated_junctions_table__{dataset}__{window}.pkl")
    with open(path, 'wb') as handle:
        pickle.dump(junctions_table, handle, protocol=pickle.HIGHEST_PROTOCOL)
InĀ [Ā ]:
# Define file paths
path_fasta = "/Users/zingo/Downloads/GRCh38_RefSeq_p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz"
path_gtf = "/Users/zingo/Downloads/GRCh38_RefSeq_p14/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz"
output_path = "/Users/zingo/Downloads/datasets/"
pickles_path = "/Users/zingo/Downloads/pickles/"
dataset = "MultiSpecies"
max_n_junctions = 500000  # No capping

# Read FASTA sequences into genome_sequences_table
genome_sequences_table = {}  # Dictionary to store genome sequences

# Read GTF file and extract gene, transcript, and exon information
genes_table = {}  # Dictionary to store gene annotations
transcripts_table = {}  # Dictionary to store transcript annotations
exons_table = {}  # Dictionary to store exon annotations

# Define window size
window = 150  

# Collect strictly annotated splice junctions (for training)
strict_annotated_junctions_table = {}  # Dictionary to store strict junctions

# Add all other splice junctions from different species
for junction_db in junction_databases:
    strict_annotated_junctions_table.update(junction_db)

# Collect all splice junctions (for masking genome)
not_strict_annotated_junctions_table = {}  # Dictionary to store non-strict junctions

# Define number of synthetic junctions to create
negative_dataset_factor = 1.0
n_made_up_junctions = negative_dataset_factor * len(strict_annotated_junctions_table.keys())

# Generate synthetic junction sequences from non-strict junctions
made_up_junctions_table = {}  # Dictionary to store made-up junctions

# Write tuning dataset files (train, eval, test) to output directory
path = output_path + dataset + str(window)

# Write dataset for prediction (dev set) after fine-tuning
path = output_path + dataset + str(window) + "_TRUE"

# Save strict junctions table as a pickle file
path = pickles_path + "Multi_strict_annotated_junctions_table__" + dataset + "__" + str(window) + ".pkl"

# Save non-strict junctions table as a pickle file
path = pickles_path + "Multi_not_strict_annotated_junctions_table__" + dataset + "__" + str(window) + ".pkl"

print("______________________________________________________________________________________________________________________________\n")

Plotting evaluation metrics:¶

InĀ [1]:
from utility_transformers import parse_tuned, plot_tuning, plot_tuning_metric_per_window

path = "/data/SpliceUp/tuned_models/"

df = parse_tuned(path, level="eval")

plot_tuning(df)

# Save the plots
for window in [90, 200, 400, 150]:
    filtered_df = df.loc[df['Window'] == window]
    plot_tuning_metric_per_window(filtered_df, "/data/SpliceUp/DeepSAP-main/scripts/plots/tuning/tuning_" + f"{window}", window)
Processing	DNABERT1_k6__Splice__GENCODE150
Processing	DNABERT1_k6__Splice__GENCODE400
Processing	DNABERT1_k6__Splice__MultiSpecies150
Processing	DNABERT1_k6__Splice__PF_r57_90
Processing	DNABERT1_k6__Splice__PF_r57_150
Processing	DNABERT1_k6__Splice__PF_r57_200
Processing	DNABERT1_k6__Splice__PF_r57_400
Processing	DNABERT1_k6__Splice__RefSeq90
Processing	DNABERT1_k6__Splice__RefSeq150
Processing	DNABERT1_k6__Splice__RefSeq200
Processing	DNABERT1_k6__Splice__RefSeq400
Processing	DNABERT2__Splice__GENCODE150
Processing	DNABERT2__Splice__GENCODE400
Processing	DNABERT2__Splice__PF_r57_90
Processing	DNABERT2__Splice__PF_r57_150
Processing	DNABERT2__Splice__PF_r57_200
Processing	DNABERT2__Splice__PF_r57_400
Processing	DNABERT2__Splice__RefSeq90
Processing	DNABERT2__Splice__RefSeq150
Processing	DNABERT2__Splice__RefSeq200
Processing	DNABERT2__Splice__RefSeq400
Processing	NT_2.5B_850g_multi_species__Splice__RefSeq90
Processing	NT_2.5B_850g_multi_species__Splice__RefSeq90_wot_LoRA
Processing	NT_2.5B_850g_multi_species__Splice__RefSeq150
Processing	NT_2.5B_850g_multi_species__Splice__RefSeq150_wot_LoRA
Processing	NT_2.5B_850g_multi_species__Splice__RefSeq200
Processing	NT_2.5B_850g_multi_species__Splice__RefSeq200_wot_LoRA
Processing	NT_2.5B_850g_multi_species__Splice__RefSeq400
Processing	NT_2.5B_850g_multi_species__Splice__RefSeq400_wot_LoRA
InĀ [3]:
from utility_transformers import parse_tuned, plot_evaluating

path = "/data/SpliceUp/tuned_models/"
df = parse_tuned(path, level="test")

plot_evaluating(df, "/data/SpliceUp/DeepSAP-main/scripts/plots/evaluation/evaluating", metric='MCC')
df.to_csv("/data/SpliceUp/DeepSAP-main/scripts/plots/evaluation/evaluation.csv")
Processing	DNABERT1_k6__Splice__GENCODE150
Processing	DNABERT1_k6__Splice__GENCODE400
Processing	DNABERT1_k6__Splice__MultiSpecies150
Processing	DNABERT1_k6__Splice__PF_r57_90
Processing	DNABERT1_k6__Splice__PF_r57_150
Processing	DNABERT1_k6__Splice__PF_r57_200
Processing	DNABERT1_k6__Splice__PF_r57_400
Processing	DNABERT1_k6__Splice__RefSeq90
Processing	DNABERT1_k6__Splice__RefSeq150
Processing	DNABERT1_k6__Splice__RefSeq200
Processing	DNABERT1_k6__Splice__RefSeq400
Processing	DNABERT2__Splice__GENCODE150
Processing	DNABERT2__Splice__GENCODE400
Processing	DNABERT2__Splice__PF_r57_90
Processing	DNABERT2__Splice__PF_r57_150
Processing	DNABERT2__Splice__PF_r57_200
Processing	DNABERT2__Splice__PF_r57_400
Processing	DNABERT2__Splice__RefSeq90
Processing	DNABERT2__Splice__RefSeq150
Processing	DNABERT2__Splice__RefSeq200
Processing	DNABERT2__Splice__RefSeq400
Processing	NT_2.5B_850g_multi_species__Splice__RefSeq90
Processing	NT_2.5B_850g_multi_species__Splice__RefSeq90_wot_LoRA
Processing	NT_2.5B_850g_multi_species__Splice__RefSeq150
Processing	NT_2.5B_850g_multi_species__Splice__RefSeq150_wot_LoRA
Processing	NT_2.5B_850g_multi_species__Splice__RefSeq200
Processing	NT_2.5B_850g_multi_species__Splice__RefSeq200_wot_LoRA
Processing	NT_2.5B_850g_multi_species__Splice__RefSeq400
Processing	NT_2.5B_850g_multi_species__Splice__RefSeq400_wot_LoRA